{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Matching Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Matching networks are yet another simple and efficient one-shot learning algorithm published by Google's DeepMind. It can even produce labels for the unobserved class in the dataset. Let us say we have a support set $S$ containing $K$ examples as ${(x_1,y_1),(x_2,y_2)...(x_k,y_k)}$. When given a query point (new unseen example) $\\hat{x}$, matching network predicts the class of $\\hat{x}$  by comparing it with the support set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can define this as $p(\\hat{y}|\\hat{x},S)$  where $p$ is the parameterized neural network, $\\hat{y}$ is the predicted class for the query point $\\hat{x}$ and  $S$ is the support set. $p(\\hat{y}|\\hat{x},S)$ will return the probability of $\\hat{x}$ belonging to each of the class in the support set. Then we select the class of  $\\hat{x}$ as the one which has highest probability. But how does this work exactly? How is this probability computed? Let's us see that now. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The class  $\\hat{y}$  of the query point $\\hat{x}$ can be predicted as,\n",
    "\n",
    "$\\hat{y} = \\sum_{i=1}^k a(\\hat{x},x_i)y_i $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's decipher this equation.  $x_i$ and $y_i$  are the input and labels of the support set.  $\\hat{x}$ is the query input that is the input to which we want to predict the label. $a$ is the attention mechanism between  $\\hat{x}$ and $x_i$ .  But how do we perform attention? Here we use simple attention mechanism which is softmax over the cosine distance between   $\\hat{x}$ and $x_i$  . "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "i.e $a(\\hat{x},x_i) = softmax(cosine(\\hat{x},x_i))$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can't calculate cosine distance between the raw inputs $\\hat{x}$ and $x_i$ directly. So, first, we will learn their embeddings and calculate the cosine distance between the embeddings. We use two different embeddings $f$ and  $g$ for learning the embeddings of $\\hat{x}$ and $x_i$  respectively. We will learn how exactly these two embedding functions $f$ and $g$ learn the embeddings in the upcoming section. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we can rewrite our attention equation as,\n",
    "\n",
    "$a(\\hat{x},x_i) = softmax (cosine (f(\\hat{x}),g(x_i)) $\n",
    "\n",
    "\n",
    "We can rewrite the above equation as, \n",
    "\n",
    "$a(\\hat{x},x_i) = \\frac{e^{cosine(f(\\hat{x}),g(x_i))}}{\\sum_{j=1}^k  e ^{cosine(f(\\hat{x}),g(x_j))}} $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, once after calculating the attention matrix $a(\\hat{x},x_i) $, we multiply our attention matrix with support set labels $y_i$ . But how can we multiply support set labels with our attention matrix? First, we convert our support set labels to the one hot encoded values and then multiply them with our attention matrix and as a result, we get the probability of our query point $\\hat{x}$ belonging to each of the class in the support set. Then we apply argmax and select  $\\hat{y}$ as the one which has a maximum probability value. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still not clear about matching networks? Look at the below figure as you can notice we have three classes in our support set {lion, elephant and dog} and we have a new query image  $\\hat{x}$. First, we feed the support set to embedding function  $g$ and query image to embedding function $f$ and learn their embeddings and calculate the cosine distance between them and then we apply softmax attention over this cosine distance. Then we multiply our attention matrix with the one hot encoded support set labels and get the probabilities and then we select $\\hat{y}$   as the one which has the highest probability. As you can notice in the below figure query set image is an elephant, and we have a high probability at the index 1, so we predict the class of $\\hat{y}$  as 1 (elephant). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "![title](Images/5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We explore more about matching networks in the next section. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
